Corpora of Speech and Text Data Now Available

February 20, 2020

The NC State University Libraries now provides access to several corpora of speech and text data for non-commercial use; these corpora may be especially appealing to those doing natural language processing and linguistics research.

The Linguistic Data Consortium (LDC) is an open consortium of universities, libraries, corporations, and government research laboratories that creates and distributes a wide array of language resources. The Libraries had a Standard Membership for 2019 and via that membership, as well as individual purchases, acquired perpetual access to five corpora from LDC's catalog.

The five corpora are:

TIDIGITS (more information: https://catalog.ldc.upenn.edu/LDC93S10)
OntoNotes (more information: https://catalog.ldc.upenn.edu/LDC2013T19)
Penn Discourse Treebank Version 3.0 (more information: https://catalog.ldc.upenn.edu/LDC2019T05)
CSR-I (WSJ0) Complete (more information: https://catalog.ldc.upenn.edu/LDC93S6A)
CSR-II (WSJ1) Complete (more information: https://catalog.ldc.upenn.edu/LDC94S13A)

Questions about this data, as well as requests for access to other corpora released in 2019, can be sent to Emily Cox, Collections & Research Librarian for Humanities, Social Sciences, & Digital Media. Purchase suggestions can be submitted using our Suggest a Purchase form.

Featured Staff

Heidi Tebbe
Former Collections & Research Librarian for Engineering and Data Science

Upcoming Events

Jul

26

Undergraduate Research and Creativity Summer Symposium

9:00am to 5:00pm
Jul

26

SECU Public Fellows Showcase

10:00am to 12:00pm
Jul

31

A Series of Questions - Trivia Series

12:15pm to 1:00pm
Aug

7

Paws for a Break

12:00pm to 1:00pm

Libraries News | Collections Highlights

Corpora of Speech and Text Data Now Available

Tags

Featured Staff

Upcoming Events

Jul

26

Jul

26

Jul

31

Aug

7

Email

QR code for this page